HDDS-13891. SCM-based health monitoring and batch processing in Recon#9258
HDDS-13891. SCM-based health monitoring and batch processing in Recon#9258sumitagrawl merged 45 commits intoapache:masterfrom
Conversation
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTaskV2.java
Outdated
Show resolved
Hide resolved
...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/SCMClientProtocolServer.java
Outdated
Show resolved
Hide resolved
…to run the replication manager logic in Recon itself.
…to run the replication manager logic in Recon itself.
dombizita
left a comment
There was a problem hiding this comment.
Thank you for working on this @devmadhuu! This is quite big PR, I went through mostly the REPLICA_MISMATCH related changes (which look good), but tried to go over the whole change set, except the test changes.
Two overall comments: I believe you have a design document for this solution, please share that on the jira and the PR description too; not sure if you used any kind of AI tool, if yes please mention it in the description (as ASF asks for that). Thanks!
.../recon/src/main/java/org/apache/hadoop/ozone/recon/api/types/UnhealthyContainerMetadata.java
Outdated
Show resolved
Hide resolved
...op-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ReconReplicationManager.java
Show resolved
Hide resolved
.../src/main/java/org/apache/hadoop/ozone/recon/persistence/ContainerHealthSchemaManagerV2.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/recon/src/test/java/org/apache/hadoop/ozone/recon/api/TestContainerEndpoint.java
Show resolved
Hide resolved
...ne/recon/src/test/java/org/apache/hadoop/ozone/recon/persistence/AbstractReconSqlDBTest.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ContainerHealthTaskV2.java
Outdated
Show resolved
Hide resolved
| * @param container The container ID to record | ||
| */ | ||
| @Override | ||
| public void incrementAndSample(ContainerHealthState stat, ContainerInfo container) { |
There was a problem hiding this comment.
we can remove other methods, just have 2 method for replica-mismatch state.
There was a problem hiding this comment.
As discussed, since Recon replication manager is maintaining its map, so other methods are needed in this class.
.../recon/src/main/java/org/apache/hadoop/ozone/recon/metrics/ContainerHealthTaskV2Metrics.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/apache/hadoop/ozone/recon/persistence/ContainerHealthSchemaManagerV2.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/apache/hadoop/ozone/recon/persistence/ContainerHealthSchemaManagerV2.java
Outdated
Show resolved
Hide resolved
...op-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ReconReplicationManager.java
Show resolved
Hide resolved
...op-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ReconReplicationManager.java
Outdated
Show resolved
Hide resolved
...op-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ReconReplicationManager.java
Show resolved
Hide resolved
...op-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ReconReplicationManager.java
Outdated
Show resolved
Hide resolved
ArafatKhan2198
left a comment
There was a problem hiding this comment.
Some comments on the patch @devmadhuu
| Set<Long> negativeSizeRecorded, | ||
| ProcessingStats stats) throws ContainerNotFoundException { | ||
| switch (state) { | ||
| case MISSING: |
There was a problem hiding this comment.
SCM's handler chain can emit composite health states like:
UNHEALTHY_UNDER_REPLICATED
QUASI_CLOSED_STUCK_MISSING
QUASI_CLOSED_STUCK_UNDER_REPLICATED
MISSING_UNDER_REPLICATED
etc.
But the switch statement in handleScmStateContainer() only handles 4 states: MISSING, UNDER_REPLICATED, OVER_REPLICATED, MIS_REPLICATED. Everything else falls into default: break; and is silently thrown away.
This means a container that is both quasi-closed-stuck AND has no replicas (QUASI_CLOSED_STUCK_MISSING) will never appear in the Recon UI or the UNHEALTHY_CONTAINERS table.
Are these composite states intentionally excluded from V2? or should we map them to their base state (e.g., QUASI_CLOSED_STUCK_UNDER_REPLICATED → store as UNDER_REPLICATED with an appropriate reason string).
There was a problem hiding this comment.
This is good point, we can map these composites into storable base states in Recon without changing DB enum/schema:
UNHEALTHY_UNDER_REPLICATED -> UNDER_REPLICATED
UNHEALTHY_OVER_REPLICATED -> OVER_REPLICATED
QUASI_CLOSED_STUCK_MISSING -> MISSING
QUASI_CLOSED_STUCK_UNDER_REPLICATED -> UNDER_REPLICATED
QUASI_CLOSED_STUCK_OVER_REPLICATED -> OVER_REPLICATED
MISSING_UNDER_REPLICATED -> both MISSING and UNDER_REPLICATED
This keeps compatibility with current UNHEALTHY_CONTAINERS allowed states and avoids silent loss.
| handleMissingContainer(containerId, currentTime, | ||
| existingInStateSinceByContainerAndState, recordsToInsert, stats); | ||
| break; | ||
| case UNDER_REPLICATED: |
There was a problem hiding this comment.
Shouldn't we have a state for REPLICA_MISMATCH also ?
There was a problem hiding this comment.
Good catch. Yes it will be added.
| List<UnhealthyContainerRecordV2> recordsToInsert, | ||
| Set<Long> negativeSizeRecorded, | ||
| ProcessingStats stats) throws ContainerNotFoundException { | ||
| switch (state) { |
There was a problem hiding this comment.
Remember, in our offline discussion we tried checking what happens if we attempt to add a state to the Derby table that violates the allowed-state constraint. I believe this switch case will prevent a new state from being added to the database.
There was a problem hiding this comment.
Yes, switch case will prevent a new state from being added to database and unsupported/new SCM states will be detected and logged (not silently dropped), while DB writes remain constrained to Recon’s allowed enum states only.
| healthSchemaManager.batchDeleteSCMStatesForContainers(containerIdsToDelete); | ||
|
|
||
| LOG.info("Inserting {} unhealthy container records", recordsToInsert.size()); | ||
| healthSchemaManager.insertUnhealthyContainerRecords(recordsToInsert); | ||
| } |
There was a problem hiding this comment.
healthSchemaManager.batchDeleteSCMStatesForContainers(containerIdsToDelete); // step 1
healthSchemaManager.insertUnhealthyContainerRecords(recordsToInsert); // step 2
These two calls are not in the same transaction. If Recon crashes between step 1 and step 2, all old health data is gone but the new data was never written. The API would return 0 unhealthy containers until the next scan runs. Also, if someone queries the API between step 1 and step 2, they get empty or partial results.
There was a problem hiding this comment.
Yes, This will be handled as part of above same comment from Sumit.
|
|
||
| for (int from = 0; from < allContainers.size(); from += PERSIST_CHUNK_SIZE) { | ||
| int to = Math.min(from + PERSIST_CHUNK_SIZE, allContainers.size()); | ||
| List<Long> chunkContainerIds = collectContainerIds(allContainers, from, to); |
There was a problem hiding this comment.
This collects every single container ID in the cluster (healthy and unhealthy) and runs DELETE statements for all of them. On a cluster with 1 million containers, that means:
- Allocating a list with 1M entries
- Running 1,000 chunked DELETE statements
- Most of those containers are healthy and have no rows in the table, so the DELETEs are wasted work
Suggestion: Instead, just delete all rows by state: DELETE FROM UNHEALTHY_CONTAINERS WHERE container_state IN (...) — one statement, no chunking, much faster.
There was a problem hiding this comment.
We no longer delete by passing all container IDs blindly. For each chunk, we first load existing rows and build existingContainerIdsToDelete, so DELETE is issued only for container IDs that actually have persisted unhealthy rows in DB. We also process in bounded chunks (PERSIST_CHUNK_SIZE), so we do not hold a single 1M containerIDs list at once.
Another reason behind switch to DELETE ... WHERE container_state IN (...) because that would clear all unhealthy rows globally before reinsert which will also remove the in_state_since replica history. Check @dombizita comments.
|
|
||
| // Call inherited processContainer - this runs SCM's health check chain | ||
| // readOnly=true ensures no commands are generated | ||
| processContainer(container, nullQueue, report, true); |
There was a problem hiding this comment.
Minor Suggestion -
processContainer(container, nullQueue, report, true); // calls getContainerReplicas() internally
Set<ContainerReplica> replicas = containerManager.getContainerReplicas(cid); // calls again
Every container's replicas are fetched twice once inside the inherited processContainer() and once for the REPLICA_MISMATCH check. On a 1 million container cluster, that's 2M replica lookups. We can Extract replicas once and pass them to both operations.
There was a problem hiding this comment.
Good catch. This one is worth a real code change, not just a reply — I’ll remove the duplicate replica fetch by adding a ReplicationManager overload that accepts pre-fetched replicas and then use it from Recon.
| dslContext.createIndex("idx_state_container_id") | ||
| .on(DSL.table(UNHEALTHY_CONTAINERS_TABLE_NAME), | ||
| DSL.field(name(CONTAINER_STATE)), | ||
| DSL.field(name(CONTAINER_ID))) |
There was a problem hiding this comment.
if (!TABLE_EXISTS_CHECK.test(conn, UNHEALTHY_CONTAINERS_TABLE_NAME)) {
createUnhealthyContainersTable(); // creates table + composite index
}
The composite index idx_state_container_id is created inside createUnhealthyContainersTable(). This method is only called if the table doesn't exist.
If someone upgrades an existing Recon deployment, the UNHEALTHY_CONTAINERS table already exists (from V1), so this entire method is skipped. The new composite index is never created on existing clusters they're stuck with the old single-column index and get none of the 43–67× performance improvement.
There was a problem hiding this comment.
Handled using new upgrade action.
...op-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ReconReplicationManager.java
Outdated
Show resolved
Hide resolved
...op-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/fsck/ReconReplicationManager.java
Outdated
Show resolved
Hide resolved
...on/src/main/java/org/apache/hadoop/ozone/recon/persistence/ContainerHealthSchemaManager.java
Outdated
Show resolved
Hide resolved
| long now = System.currentTimeMillis(); | ||
| long insertStart = System.nanoTime(); | ||
|
|
||
| for (int startId = 1; startId <= CONTAINER_ID_RANGE; startId += CONTAINERS_PER_TX) { |
There was a problem hiding this comment.
Current code, check if can optimize or change code
start tx
- delete .5 million
loop 1 million batch:- batch insert 1k
end loop
end tx
- batch insert 1k
There was a problem hiding this comment.
Updated the tests. Below is perf data:
2026-03-23 17:47:28,930 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:testBatchInsertOneMillionRecords(290)) - --- Test 1: Batch INSERT 1000000 records (2000 containers/tx, 100 transactions) ---
2026-03-23 17:47:43,984 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:testBatchInsertOneMillionRecords(304)) - Batch INSERT complete: 1000000 records in 15037 ms (66503 rec/sec, 100 tx)
2026-03-23 17:47:43,993 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:testTotalInsertedRecordCountIsOneMillion(319)) - --- Test 2: Verify total row count = 1000000 ---
2026-03-23 17:47:44,267 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:testTotalInsertedRecordCountIsOneMillion(325)) - COUNT(*) = 1000000 rows in 274 ms
2026-03-23 17:47:44,270 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:testCountByStatePerformanceUsesIndex(345)) - --- Test 3: COUNT(*) by state (index-covered, 200000 records each) ---
2026-03-23 17:47:44,361 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:testCountByStatePerformanceUsesIndex(359)) - COUNT(UNDER_REPLICATED) = 200000 rows in 91 ms
2026-03-23 17:47:44,495 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:testCountByStatePerformanceUsesIndex(359)) - COUNT(MISSING) = 200000 rows in 133 ms
2026-03-23 17:47:44,624 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:testCountByStatePerformanceUsesIndex(359)) - COUNT(OVER_REPLICATED) = 200000 rows in 128 ms
2026-03-23 17:47:44,687 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:testCountByStatePerformanceUsesIndex(359)) - COUNT(MIS_REPLICATED) = 200000 rows in 62 ms
2026-03-23 17:47:44,722 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:testCountByStatePerformanceUsesIndex(359)) - COUNT(EMPTY_MISSING) = 200000 rows in 33 ms
2026-03-23 17:47:44,723 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:testGroupBySummaryQueryPerformance(385)) - --- Test 4: GROUP BY summary over 1000000 rows ---
2026-03-23 17:47:45,271 [ForkJoinPool-1-worker-1] INFO impl.Tools (JooqLogger.java:info(338)) - Kotlin is available, but not kotlin-reflect. Add the kotlin-reflect dependency to better use Kotlin features like data classes
2026-03-23 17:47:45,272 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:testGroupBySummaryQueryPerformance(392)) - GROUP BY summary: 5 state groups returned in 548 ms
2026-03-23 17:47:45,272 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:lambda$testGroupBySummaryQueryPerformance$0(395)) - state=EMPTY_MISSING count=200000
2026-03-23 17:47:45,273 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:lambda$testGroupBySummaryQueryPerformance$0(395)) - state=MISSING count=200000
2026-03-23 17:47:45,273 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:lambda$testGroupBySummaryQueryPerformance$0(395)) - state=MIS_REPLICATED count=200000
2026-03-23 17:47:45,273 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:lambda$testGroupBySummaryQueryPerformance$0(395)) - state=OVER_REPLICATED count=200000
2026-03-23 17:47:45,273 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:lambda$testGroupBySummaryQueryPerformance$0(395)) - state=UNDER_REPLICATED count=200000
2026-03-23 17:47:45,274 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:testPaginatedReadByStatePerformance(431)) - --- Test 5: Paginated read of UNDER_REPLICATED (200000 records, page size 5000) ---
2026-03-23 17:47:45,779 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:testPaginatedReadByStatePerformance(470)) - Paginated read: 200000 records in 40 pages, 504 ms (200000 rec/sec)
2026-03-23 17:47:45,781 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:testFullDatasetReadThroughputAllStates(496)) - --- Test 6: Full 1 M record read (all states, paged) ---
2026-03-23 17:47:46,155 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:testFullDatasetReadThroughputAllStates(524)) - State UNDER_REPLICATED: 200000 records in 373 ms
2026-03-23 17:47:46,532 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:testFullDatasetReadThroughputAllStates(524)) - State MISSING: 200000 records in 376 ms
2026-03-23 17:47:46,915 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:testFullDatasetReadThroughputAllStates(524)) - State OVER_REPLICATED: 200000 records in 382 ms
2026-03-23 17:47:47,275 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:testFullDatasetReadThroughputAllStates(524)) - State MIS_REPLICATED: 200000 records in 359 ms
2026-03-23 17:47:47,647 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:testFullDatasetReadThroughputAllStates(524)) - State EMPTY_MISSING: 200000 records in 371 ms
2026-03-23 17:47:47,648 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:testFullDatasetReadThroughputAllStates(532)) - Full dataset read: 1000000 total records in 1865 ms (536193 rec/sec)
2026-03-23 17:47:47,650 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:testAtomicReplaceDeleteAndInsertInSingleTransaction(562)) - --- Test 7: Atomic replace — 200000 IDs × 5 states = 1000000 rows in one tx ---
2026-03-23 17:49:06,774 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:testAtomicReplaceDeleteAndInsertInSingleTransaction(575)) - Atomic replace completed in 79103 ms
2026-03-23 17:49:07,003 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:testBatchDeletePerformanceOneMillionRecords(626)) - --- Test 8: Batch DELETE — 200000 IDs × 5 states = 1000000 rows (200 internal SQL statements of 1000 IDs) ---
2026-03-23 17:50:11,884 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:testBatchDeletePerformanceOneMillionRecords(644)) - DELETE complete: 200000 IDs (1000000 rows) in 64881 ms via 200 SQL statements (15413 rows/sec)
2026-03-23 17:50:11,917 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:testBatchDeletePerformanceOneMillionRecords(649)) - Rows remaining after delete: 0 (expected 0)
2026-03-23 17:50:11,918 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:testCountByStateAfterFullDelete(672)) - --- Test 9: COUNT by state after full delete (expected 0 each) ---
2026-03-23 17:50:11,934 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:testCountByStateAfterFullDelete(686)) - COUNT(UNDER_REPLICATED) = 0 rows in 15 ms
2026-03-23 17:50:11,951 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:testCountByStateAfterFullDelete(686)) - COUNT(MISSING) = 0 rows in 16 ms
2026-03-23 17:50:11,966 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:testCountByStateAfterFullDelete(686)) - COUNT(OVER_REPLICATED) = 0 rows in 15 ms
2026-03-23 17:50:11,968 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:testCountByStateAfterFullDelete(686)) - COUNT(MIS_REPLICATED) = 0 rows in 1 ms
2026-03-23 17:50:11,970 [ForkJoinPool-1-worker-1] INFO persistence.TestUnhealthyContainersDerbyPerformance (TestUnhealthyContainersDerbyPerformance.java:testCountByStateAfterFullDelete(686)) - COUNT(EMPTY_MISSING) = 0 rows in 1 ms
ArafatKhan2198
left a comment
There was a problem hiding this comment.
Thanks for the changes @devmadhuu
LGTM +1
What changes were proposed in this pull request?
This PR Implements
ContainerHealthTaskV2by extending SCM's ReplicationManager for use in Recon. This approach evaluates container health locally using SCM's proven health check logic without requiring network communication between SCM and Recon.Design
https://docs.google.com/document/d/1iea0eC4IpPa4Qpmc47Ae3KyneFCZ_fyyuhbZwqrR3cM/edit?pli=1&tab=t.0#heading=h.986yaoz7wnxv
Implementation Approach
Introduces ContainerHealthTaskV2, a new implementation that determines container health states by:
ReplicationManagerasReconReplicationManagerprocessAll()to evaluate all containers using SCM's proven health check logicContainer Health States Detected
ContainerHealthTaskV2 detects 5 distinct health states:
SCM Health States (Inherited)
Recon-Specific Health State
Implementation: ReconReplicationManager first runs SCM's health checks, then additionally checks for REPLICA_MISMATCH by comparing checksums across replicas. This ensures both replication health and data integrity are monitored.
Testing
UNHEALTHY_CONTAINERStableDatabase Schema
Uses existing
UNHEALTHY_CONTAINERS_V2table with support for all 5 health states:Each record includes:
Some code optimizations in this PR for Recon's
ContainerHealthTaskare done using Cursor AI tool.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-13891
How was this patch tested?
Added junit test cases and tested using local docker cluster.
Recon
UNHEALTHY_CONTAINERSTable — Performance OptimisationsSummary
This PR also improves the read throughput of the
UNHEALTHY_CONTAINERSDerby table by43–67×, fixes a latent
ERROR XBCM4crash that would occur on any cluster largeenough to trigger a >2,000-container DELETE in one statement, and removes a redundant
Java-side sort that was executing on every paginated API response.
Changes
1. New composite index —
ContainerSchemaDefinition.javaOld index:
New index:
Why this matters for paginated reads:
With the old single-column index, Derby had to:
container_state = ?(up to 200K entries).container_idon every single page call — an O(n) operation repeated once per page.With the composite index, Derby jumps directly to
(state, minContainerId)and reads the nextLIMITentries sequentially — O(1) per page regardless of cursor position or total row count.The composite index also covers
COUNT(*) WHERE container_state = ?andGROUP BY container_statequeries via its leading column prefix, so those queries retain their index-only access path.2.
ContainerHealthSchemaManagerV2.getUnhealthyContainers()— two fixesa) Conditional Java-side sort removed for forward pagination
For the common forward-pagination path (
minContainerId), the SQLORDER BY container_id ASCalready delivers sorted rows. The redundant Java sort was calling
Comparator.comparingLongon every page (up to 200 pages per state per request).b) JDBC fetch-size hint added
Derby's default JDBC fetch size is 1 row per wire call. For a 5,000-row page this
meant 5,000 individual JDBC fetch round-trips inside the driver before any data reached
the application layer. Setting
fetchSize(limit)pre-buffers the full page in a singleJDBC call.
3.
ContainerHealthSchemaManagerV2.batchDeleteSCMStatesForContainers()— internal chunkingBug fixed: Derby's SQL compiler generates a Java class per prepared statement.
A
WHERE container_id IN (N values)predicate combined with the 7-statecontainer_state IN (…)predicate generates an expression tree whose compiledbytecode can exceed the JVM 65,535-byte per-method limit (
ERROR XBCM4).The method previously delegated chunking to callers. On a large cluster
ReconReplicationManager.persistUnhealthyRecords()passes the full container listin one call — which would crash Derby on any cluster with >2,000 containers in a
single scan batch.
Fix: the method now chunks internally at
MAX_DELETE_CHUNK_SIZE = 1,000IDs perSQL statement. Callers pass any size list; the method is safe by construction.
Performance Test (Added in last commit)
A new test class
TestUnhealthyContainersDerbyPerformancebenchmarks all operationsat 1 million records (5 states × 200,000 container IDs).
Test environment: macOS 14 (Apple M-series), JDK 8, Derby 10.14 embedded,
derby.storage.pageCacheSize = 20,000(~80 MB page cache).Results — baseline vs. optimised
COUNT(*)totalCOUNTby state (avg of 5)GROUP BYsummary (all states)COUNTby state after delete (avg)Raw logs
Optimised run output (click to expand)